Estimate: a realized value of \(W(\mathbf{x})\) applied to an actual sample, \(\mathbf{x}\)
\(\bar{x} = W(176,177,175,179,173) = 176\)
Evaluating an Estimator
The goal of an estimator is to estimate the estimand well.
This is important because it’s what allows us to make inferences about a Population statistic based on a Sample statistic. This is the core of inference. Good estimators will be:
Unbiased
Consistent
Efficient
Evaluating an Estimator: Unbiasedness
Intuitive Idea: our estimator, \(\hat{\theta}\) should not consistently mis-estimate \(\theta\) in a systematic way
from Scott Fortmann-Roe
Evaluating an Estimator: Unbiasedness
Math Idea: an estimator \(\hat{\theta}\) is unbiased if
Thus, the sample mean, \(\hat{\mu}\) is an unbiased estimator of the population mean \(\mu\)
Note we’re using \(\hat{}\) in this section (pronounced “hat”) for consistency. Common estimators like the sample mean often have their own symbols like \(\bar{x}\) which you’ll also see used. In general, putting a \(\hat{}\) on something means that we’re creating an estimate of it.
Evaluating an Estimator: Unbiasedness
Population Variance = \(\frac{1}{n}\sum(x_i-\mu)^2\) Sample Variance = \(\frac{1}{n}\sum(x_i-\bar{x})^2\)
Sample Variance
# Function to demonstrate bias in sample variance estimationdemonstrate_variance_bias <-function(n =10, true_mean =0, true_var =1, n_sims =1000) {# Initialize vectors to store estimates var_biased <-numeric(n_sims) var_unbiased <-numeric(n_sims)# Run simulationsfor (i in1:n_sims) {# Generate random normal data x <-rnorm(n, mean = true_mean, sd =sqrt(true_var))# Biased estimator: divide by n var_biased[i] <-sum((x -mean(x))^2) / n# Unbiased estimator: divide by n-1 var_unbiased[i] <-sum((x -mean(x))^2) / (n-1) }return(data.frame(biased = var_biased, unbiased = var_unbiased))}
Thus, the sample mean, \(\hat{\sigma}^2\) is a biased estimator of the population variance \(\sigma^2\)
Note: this is why, when calculating the sample variance, we divide by \(N-1\) instead of \(N\). Intuitively this makes sense: when we estimate the sample mean \(\hat{\mu}\), we’re losing 1 degree of freedom, then we use that estimate to estimate \(\hat{\sigma}^2\)…
Evaluating an Estimator: Consistency
Intuitive Idea: as we collect more data (information) the estimator should approximate the estimand more closely.
If we could have \(\infty\) information, our estimator should spit out estimates equal to the estimand.
Evaluating an Estimator: Consistency
Math Idea:
\[
\lim_{n \to \infty} \hat{\theta} = \theta
\]
Example:
Sample Mean: The Law of Large Numbers guarantees that:
\[
\lim_{n \to \infty} \hat{\mu} = \mu
\]
Evaluating an Estimator: Consistency
Weak Law of Large Numbers
Intuitive Idea: as you get more and more independent, random samples of \(\mathbf{X}\), the sample mean of \(\mathbf{X}\) will get closer and closer (and eventually converge) to it’s expected value.
Math Idea: for all \(\epsilon > 0\), if \(\sigma^2 < \infty\)
As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). So the probability that \(|\bar{X_n} - \mu| \geq \epsilon\) goes to \(0\). Thus, \(P(|\bar{X_n} - \mu| < \epsilon) \to 1\)
Note: we’re using Chebychev’s Inequality here which states that \(P(|X-\mu| \geq k) \leq \frac{\sigma^2}{k^2}\)
Evaluating an Estimator: Efficiency
Intuitive Idea: the estimate we get should have the smallest variance possible (so that we can be more confident about our estimate with as little data as possible)
Evaluating an Estimator: Efficiency
Math Idea:
\[
Var(\hat{\theta}) \geq \frac{1}{I(\theta)}
\]
where \(I(\theta)\) is the Fisher Information for \(\theta\)
Evaluating an Estimator: Efficiency
Fisher Information
Intuitive Idea: the amount of information that a sample from a random variable \(\mathbf{X}\) can give us about a parameter \(\theta\)
Imagine that everyone in Room A has the same number of cats (\(\mu_A\))
Imagine that cat ownership in Room B is defined as \(Pois(\mu_B)\) where \(\mu_B\) is the mean number of cats owned in Room B
In which Room do I learn more about \(\mu\) by asking one person how many cats they own?
Evaluating an Estimator: Efficiency
Fisher Information
Intuitive Idea: the amount of information that a sample from a random variable \(\mathbf{X}\) can give us about a parameter \(\theta\)
Imagine that Room A has \(\text{height}_{cm} \sim \mathcal{N}(\mu_A, 8)\)
Imagine that Room B has \(\text{height}_{cm} \sim \mathcal{N}(\mu_B, 1)\)
In which Room do I learn more about \(\mu\) by asking one person their height?
Evaluating an Estimator: Efficiency
Fisher Information
Slightly Mathy Idea: Fisher Information measures how sensitive the log-likelihoodfunction \(\ell(\theta | X)\) is to changes in \(\theta\) (more sensitive \(\to\) more information)
where \(\ell(\theta | X)\) is the log-likelihood of \(\theta\) given \(X\). If \(\ell\) is sensitive to changes in \(\theta\), the second derivative should be large and we expect to see high information
Note: usually \(\ell\) is concave down around maximum likelihood estimate, meaning that the second-derivative will be negative, hence the negative sign in front of the expectation
Estimators Wrap-Up
Estimators take in sample data and produce a sample estimate
Good estimators produce estimates that allow us to make inferences about population parameters
Unbiased, Consistent, Efficient
These estimates are (so far) individual numbers that are guesses for population parameters
Point Estimates vs. Interval Estimates
Point Estimate: a single value calculated based on a sample that estimates a population parameter
Interval Estimate: a range of values calculated based on a sample that estimate a population parameter with uncertainty
E.g. the mean height of Michaels is 178cm vs. the mean height of Michaels is between 175-181cm
Point Estimates vs. Interval Estimates
Think about the research or industry work you’ve done. When would interval estimates have been helpful?
Point Estimates vs. Interval Estimates
Think about the research or industry work you’ve done. When would interval estimates have been helpful?
My Story: the clients who didn’t report uncertainty…
Inference: Second Problem
In the last class, we talked about the first problem of inference: data is too complex to reason about, we need summaries. But now that we’ve exhaustively discussed the problem of point estimates, we run into the second problem of inference…
Uncertainty.
Inference: Second Problem
Claim: My mean crossword time is faster than yours.
my time: 25m 05s
your time: 25m 23s
Is this 👆 enough to convince you that my mean time is faster than yours? Why/Why not?
In the past month, it’s rained on 9 of the 30 days. 🌧️
Frequentist: the probability of rain is \(q = \frac{9}{30} = 0.3\)
Bayesian: before seeing the data, values of \(q\) near \(0.1\) sounded the most reasonable based on my knowledge of California. After seeing the data, I think the probability of rain \(q\) is most likely \(0.25\) but there’s a lot of uncertainty.
Frequentist Uncertainty
In a frequentist analysis, uncertainty represents sampling variability: the difference between estimates on repeated, similar samples.
💡 if i took a bunch of random samples exactly like this one, how variable would my estimates be?
Frequentist Uncertainty
If I take repeated random samples of 2 people from my list of J’s, what are the different mean ages that I get?
Everyone run this code 10 times, and put the proportions of heads in this sheet.
Sampling Distributions: Example
Sampling Distribution of \(q\)
How much uncertainty do we have about \(\hat{q}\)?
Sampling Distributions: Example
Sampling Distribution of \(q\)
What is the range of \(\hat{q}\)s that cover 90% of samples?
What is our best guess for what \(\hat{q}\) is?
Sampling Distributions: Example
Sampling Distribution of \(q\)
What is the range of \(\hat{q}\)s that cover 90% of samples?
# simulated sampling distcoin_flips <-replicate(10000, mean(sample(0:1, size =100, replace =TRUE)))# calculate 5th and 95th percentileci <-quantile(coin_flips, c(0.05,0.95))# plotggplot(data =data.frame(x = coin_flips),aes(x = x)) +geom_histogram(binwidth =0.02, fill ="blue", color ="darkgray") +xlim(c(0.2,0.8)) +geom_segment(x = ci[[1]],xend = ci[[2]],y =-1,yend =-1,linewidth =2) +labs(x =expression(hat(q)),y ="",title ="Sampling Distribution of Sample Prop")
Sampling Distributions: Example
Sampling Distribution of \(q\)
What is our best guess for what \(\hat{q}\) is?
# simulated sampling distcoin_flips <-replicate(10000, mean(sample(0:1, size =100, replace =TRUE)))# calculate meanmu <-mean(coin_flips)# plotggplot(data =data.frame(x = coin_flips),aes(x = x)) +geom_histogram(binwidth =0.02, fill ="blue", color ="darkgray") +xlim(c(0.2,0.8)) +geom_vline(xintercept = mu,linewidth =2) +labs(x =expression(hat(q)),y ="",title ="Sampling Distribution of Sample Prop")
Sampling Distribution: Analytical
So far, we used Monte Carlo simulations to approximate the sampling distribution. But, often we can directly calculate it instead.
Claim: Sampling Distributions of sample means will (often) be a Normal Distribution, and we can use what we know about Normal Distributions to calculate our point estimate (best guess) and our interval estimates (uncertainty).
Review: LLN
For a random variable \(X\) with finite variance \(\sigma^2\) and expected value \(\mu\)
As \(n \to \infty\), \(\frac{\sigma^2}{n\epsilon^2} \to 0\). So the probability that \(|\bar{X_n} - \mu| \geq \epsilon\) goes to \(0\). Thus, \(P(|\bar{X_n} - \mu| < \epsilon) \to 1\)
💡 In other words, \(\bar{X}_n \to \mu\), as \(n \to \infty\). The larger our sample, the more concentrated the sampling distribution will be around \(\mu\)
Central Limit Theorem
Note: this is the Central, Limit-Theorem:
Let \(\mathbf{X}\) be a random variable with finite variance \(\sigma^2\). As \(n \to \infty\), the distribution of sample means will be distributed as a normal distribution: